The American Housing Survey is a regular survey conducted by the US Census in coordination with the US Department of Housing and Urban Development that asks questions about housing and affordability. Here is what the US Census says about the AHS:
The American Housing Survey (AHS) is sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau. The survey has been the most comprehensive national housing survey in the United States since its inception in 1973, providing current information on the size, composition, and quality of the nation’s housing and measuring changes in our housing stock as it ages. The AHS is a longitudinal housing unit survey conducted biennially in odd-numbered years, with samples redrawn in 1985 and 2015 .
The survey provides up-to-date information about the quality and cost of housing in the United States and major metropolitan areas. The survey also includes questions about:
- the physical condition of homes and neighborhoods,
- the costs of financing and maintaining homes, and
- the characteristics of people who live in these homes.
- Planners, policy makers, and community stakeholders use the results of the AHS to assess the housing needs of communities and the country. These statistics inform decisions that affect the housing opportunities for people of all income levels, ages, and racial and ethnic groups.
Since our country changes rapidly, policymakers in government and private organizations need current housing information to make decisions about programs that will affect people of all income levels, ages, and racial and ethnic groups.
In this project, we will use the 2021 Public Use Files. These files strip identifying information from the respondents, including location information such as the state or city. The data are also broken up into four individual files.
../data/ahs_2021/household.csv.gz: The distinct
households surveyed.../data/ahs_2021/person.csv.gz: People living in the
households (there may be more than one)../data/ahs_2021/mortgage.csv.gz: Mortgage (loan)
information, which does not apply to all households.../data/ahs_2021/project.csv.gz:You should become familiar with the codebook
for the version of the survey we are using. When you find variables,
you can expand the description by pressing the + button and
then clicking on the “Survey Year: 2021 National” option. Here is an
example describing the variable that serves as the primary key for each
table (each table has its own control number, one per row).
Load each of the files listed above into a table (I suggest using
household, project, etc., and further tasks
will mention these tables by name). While loading the data, pay
particular attention to the following:
import(path_to_table) function.JCOLUMN_NAME
where COLUMN_NAME is another column in the same table.
These variables indicate if there was any imputation or editing of
values. For our purposes we can ignore these columns. But there are also
a few columns that start with “JOB” that are valid columns.col_select argument to avoid importing these
columns.select with
-matches(...) to exclude the REPW columns
after importing them.as.numeric).'
followed by an optional “-” character, followed by 1 or more digits,
followed by a final “’” character. For example, it should match
'5' and '-6'. Use this pattern with
mutate_if (see also str_detect and
all) and your function from the previous point to change
all of these character columns into numeric values.NA. Apply this function to
any numeric columns in each table (again the mutate_if
function can be helpful here).household: BLD, BATHROOMS,
DIVISION, HOA (true/false)person: RACEmortgage: LOANTYPEis_num<-function(x){
all(str_detect(x,"(^'[0-9])|('-)"))
}
strip_and_convert <- function(x) {
d<-(gsub("(^')|('$)", "", x))
as.numeric(d)
}
is_na <- function(x) {
ifelse((x==-9|x==-6),NA,x)
}
import<-function(file){
read_csv(file=file,col_select =matches("^JOB|^[^J]")) %>%
select(-matches("REPW")) %>%
mutate_if(is_num,strip_and_convert) %>% mutate_if(is.numeric,is_na)
}
project <- import('../data/ahs_2021/project.csv.gz')
mortgage <- import("../data/ahs_2021/mortgage.csv.gz")
household <- import("../data/ahs_2021/household.csv.gz")
person <- import("../data/ahs_2021/person.csv.gz")
dim(project)
## [1] 51476 8
dim(mortgage)
## [1] 19155 19
dim(household)
## [1] 64141 305
dim(person)
## [1] 135926 49
Recoded data:
household<-household %>% mutate(BLD=recode(BLD,
'0'='Mobile home or trailer',
'1'='One-family house, detached',
'2'='One-family house, attached',
'3'='Mobile home or trailer',
'4'='2 apartments',
'5'='3 to 4 apartments',
'6'='5 to 9 apartments',
'7'='10 to 19 apartments',
'8'='20 to 49 apartments',
'9'='50 or more apartments',
'10'='Boat, RV, van, etc.'
),
BATHROOMS=recode(BATHROOMS,
'1'='One full bathroom',
'2'='1.5 bathrooms',
'3'='2 bathrooms',
'4'= '2.5 bathrooms',
'5'= '3 bathrooms',
'6'= 'More than 3 bathrooms',
'7'= 'No full bath: sink and tub present',
'8'= 'No full bath: sink and toilet present',
'9'= 'No full bath: tub and toilet present',
'10'= 'No full bath: sink only',
'11'= 'No full bath: tub only',
'12'= 'No full bath: toilet only',
'13'= 'No full bath: no sink, tub or toilet'
),
DIVISION=recode(DIVISION,
'1' ='New England',
'2' = 'Middle Atlantic',
'3' = 'East North Central',
'4' = 'West North Central',
'5' = 'South Atlantic',
'6' = 'East South Central',
'7' = 'West South Central',
'8' = 'Mountain',
'9' = 'Pacific'
),
HOA=ifelse(HOA==1,'true','false')
)
person<-person %>%
mutate(
RACE=recode(RACE,
'1'="White only",'2'='Black only',
'3'='American Indian, Alaska Native only',
'4'='Asian only',
'5'='Hawaiian, Pacific Islander only',
'6'='White / Black','7'='White / American Indian, Alaska Native',
'8'='White / Asian',
'9'='White / Hawaiian, Pacific Islander',
'10'='Black / American Indian, Alaska Native',
'11'='Black / Asian',
'12'='Black / Hawaiian, Pacific Islander',
'13'='American Indian, Alaska Native / Asian',
'14'='Asian / Hawaiian, Pacific Islander',
'15'='White / Black / American Indian, Alaska Native',
'16'='White / Black / Asian',
'17'='White / American Indian, Alaska Native / Asian',
'18'= 'White / Asian / Hawaiian, Pacific Islander',
'19'= 'White / Black / American Indian, Alaska Native / Asian',
'20'= 'Other combinations of 2 or 3 races',
'21'= 'Other combinations of 4 or more races'
))
mortgage<-mortgage %>%
mutate(LOANTYPE=recode(LOANTYPE,
'1'='Primary loan','2'='Secondary loan','3'='Home equity line of credit'
))
DIVISIONYRBUILT
(hint: what kind of variable is this? See the codebook.)MARKETVAL) typically higher for
households that have a homeowner’s association (HOA)?UNITSIZE values with the
midpoint of the range. How does the number of bedrooms change with
larger homes?DIVISION.household %>%
ggplot(aes(y=factor(DIVISION,levels = names(sort(table(DIVISION))))))+
geom_bar(fill=4,col=1)+labs(title = "Regional quantity distribution",
y="DIVISION")
ggplot(household,aes(x=YRBUILT))+
geom_histogram(fill=4,col=1)
#Use categorical variables to remove the interval
ggplot(household,aes(y=factor(YRBUILT)))+
geom_bar(fill=4,col=1)
household %>% filter(!is.na(HOA))%>%group_by(HOA) %>%
ggplot(aes(y = MARKETVAL, x = HOA)) +
stat_summary(
fun.min = min,
fun.max = max,
fun = median
)
household %>% filter(!is.na(HOA))%>%group_by(HOA) %>% summarise(mean=mean(MARKETVAL,na.rm = T),sd=sd(MARKETVAL,na.rm = T))
## # A tibble: 2 × 3
## HOA mean sd
## <chr> <dbl> <dbl>
## 1 false 418583. 548318.
## 2 true 497148. 556395.
household %>% filter(!is.na(HOA)) %>% ggplot(aes(y=MARKETVAL,x=HOA,fill=HOA))+geom_violin()
household %>% filter(!is.na(HOA)) %>%ggplot(aes(y=MARKETVAL,x=HOA,fill=HOA))+geom_boxplot(width=.3,position=position_dodge(width=0.9))+ylim(0,2*10^6)
household2<- household %>%
mutate(UNITSIZE_midpoint=recode(UNITSIZE,
'1'=250,
'2'=625,
'3'=875,
'4'=1250,
'5'=1750,
'6'=2250,
'7'=2750,
'8'=3500,
'9'=4000
))
household2%>% ggplot(aes(y=BEDROOMS,x=UNITSIZE_midpoint)) +geom_point(position = "jitter",alpha=0.4,size=0.05,col=1)+geom_smooth(method = "lm",col='red')
household2%>% filter(!is.na(UNITSIZE_midpoint))%>% ggplot(aes(y=factor(BEDROOMS),x=factor(UNITSIZE_midpoint))) +geom_bin_2d()+scale_fill_gradient(low = "lightblue",high = "red")
group_by and summarize to investigate
a variable not used as a grouping factor in the required sections above.
Write up your findings in written form (3 to 5 sentences).DIVISION column?)?
Define how you will define “affordable” and explain how will you choose
to select a region using your measurement of affordability. Implement
affordability using mutate. Compare the regions and explain
your results.household %>% group_by(BATHROOMS) %>% summarise(n=n()) %>% arrange(-n) %>% mutate(percentage =paste0(round(n/sum(n)*100,3),"%"))
## # A tibble: 11 × 3
## BATHROOMS n percentage
## <chr> <int> <chr>
## 1 One full bathroom 23872 37.218%
## 2 2 bathrooms 18235 28.43%
## 3 2.5 bathrooms 7958 12.407%
## 4 1.5 bathrooms 6426 10.019%
## 5 3 bathrooms 6008 9.367%
## 6 More than 3 bathrooms 1495 2.331%
## 7 No full bath: no sink, tub or toilet 129 0.201%
## 8 No full bath: sink and toilet present 9 0.014%
## 9 No full bath: toilet only 5 0.008%
## 10 No full bath: tub and toilet present 3 0.005%
## 11 No full bath: sink and tub present 1 0.002%
household %>% group_by(BATHROOMS) %>% summarise(u=mean(MARKETVAL,na.rm=T)) %>% ggplot(aes(x=u,y=BATHROOMS))+geom_point()+geom_line()
household %>%ggplot(aes(x=MARKETVAL,y=BATHROOMS))+geom_violin(fill=4)
household %>%group_by(DIVISION) %>% summarise(expend=median(TOTHCAMT,na.rm = T),income=median(FINCP,na.rm = T)) %>% mutate(Disposable_income=income-expend,p=expend/income) %>% arrange(p)
## # A tibble: 9 × 5
## DIVISION expend income Disposable_income p
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 West North Central 879 48000 47121 0.0183
## 2 East North Central 958 48600 47642 0.0197
## 3 Mountain 1200. 55000 53800. 0.0218
## 4 West South Central 1094 50000 48906 0.0219
## 5 East South Central 680 30700 30020 0.0221
## 6 South Atlantic 1218 52000 50782 0.0234
## 7 Pacific 1656 66950 65294 0.0247
## 8 Middle Atlantic 1181 46780 45599 0.0252
## 9 New England 1530 60000 58470 0.0255
household %>% filter(!ELECAMT%in%c(0:4)&!GASAMT%in%c(0:4))%>% ggplot(aes(y=ELECAMT,x=GASAMT ))+geom_point(size=0.05,col=2)+geom_smooth()
PERSONID variable. How is variable
different from CONTROL? If you want to find out how many
people live in a household, how can you do that? What is the maximum
number of people living in a household in this data set?CONTROL represents the head of the household, and each
head may have multiple members of the PERSONID living in a
room. The largest family was 19 peopleperson %>% group_by(PERSONID)%>% summarise(n=n())
## # A tibble: 135,926 × 2
## PERSONID n
## <dbl> <int>
## 1 11000005001 1
## 2 11000005002 1
## 3 11000007005 1
## 4 11000007006 1
## 5 11000007007 1
## 6 11000010002 1
## 7 11000010003 1
## 8 11000013001 1
## 9 11000013002 1
## 10 11000014003 1
## # ℹ 135,916 more rows
person %>% group_by(CONTROL)%>% summarise(n=n()) %>% arrange(-n)
## # A tibble: 56,058 × 2
## CONTROL n
## <dbl> <int>
## 1 11025555 19
## 2 11046927 16
## 3 11037862 13
## 4 11072988 13
## 5 11082642 13
## 6 11025451 12
## 7 11025524 12
## 8 11068752 12
## 9 11075513 12
## 10 11084358 12
## # ℹ 56,048 more rows
age<-person %>% group_by(CONTROL) %>% summarise(NUM=n(),min_age=min(AGE),median_age=median(AGE),max_age=max(AGE),mean=(min_age+median_age+min_age)/3)
age
## # A tibble: 56,058 × 6
## CONTROL NUM min_age median_age max_age mean
## <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 11000005 2 73 73.5 74 73.2
## 2 11000007 3 22 55 55 33
## 3 11000010 2 55 67.5 80 59.2
## 4 11000013 2 60 60.5 61 60.2
## 5 11000014 3 10 38 52 19.3
## 6 11000017 3 13 43 49 23
## 7 11000018 1 21 21 21 21
## 8 11000019 1 80 80 80 80
## 9 11000021 6 4 16 38 8
## 10 11000022 1 70 70 70 70
## # ℹ 56,048 more rows
The person table contains several categorical variables. Using grouping on more than variable to explore the relationships between these variables. For example, for each level of variable A, what is the most common value of variable B? Write up what you find.
MLPE served in the Vietnam War, most people did not
answer whether they participated in the Vietnam War, should be not
understand
The percentage of Hispanics who did not serve in the Vietnam War was higher than the percentage of Hispanics who served in the Vietnam War
person %>% mutate(MLPE=recode(MLPE,"1"='yes',"2"='no'),
SPAN=recode(SPAN,"1"='yes',"2"='no'))%>%
group_by(MLPE) %>% count(SPAN)%>%
mutate(p=paste0(round(n/sum(n)*100,2),"%"))%>% arrange(-n,.by_group=T)
## # A tibble: 6 × 4
## # Groups: MLPE [3]
## MLPE SPAN n p
## <chr> <chr> <int> <chr>
## 1 no no 4230 91.42%
## 2 no yes 397 8.58%
## 3 yes no 2363 95.01%
## 4 yes yes 124 4.99%
## 5 <NA> no 100230 77.81%
## 6 <NA> yes 28582 22.19%
PMTAMT and interpret
the results.AMMORT contains
information on the amount of the mortgage for primary and secondary
types. With home equity lines of credit (HELOC) loans, home owners have
a credit limit that they could borrow at any time, which is contained in
the HELOCLIM column. Create a new column that merges
AMMROT and HELOCLIM to give a picture how much
money has or could be borrowed. Plot the conditional distribution of
this new column for each of the LOANTYPE categories.
Compare both typical values and variation. Write up a 2 or 3 sentences
explaining your results.there are more than one mortgage,like Primary loan and Secondary loan.And the families shown below have three loans
mortgage %>%count(CONTROL) %>% arrange(-n) %>% head(5)
## # A tibble: 5 × 2
## CONTROL n
## <dbl> <int>
## 1 11005033 3
## 2 11005360 3
## 3 11015764 3
## 4 11023551 3
## 5 11032795 3
mortgage %>% ggplot(aes(PMTAMT))+geom_histogram(bins=100,fill=2,col=1)+xlim(0,15000)
mortgage<-mortgage %>% mutate(both_amount=ifelse(LOANTYPE%in%c('Primary loan','Secondary loan'),AMMORT,HELOCLIM))
mortgage %>% ggplot(aes(x=both_amount,y=LOANTYPE))+geom_boxplot()
mortgage %>% ggplot(aes(x=both_amount,y=LOANTYPE))+geom_boxplot()+xlim(0,2.5*10^6)
REFI). Include at least one plot.the loan amount for refinancing was higher on mean and median.
mortgage %>% filter(!is.na(REFI))%>% mutate(REFI=recode(REFI,'1'="yes",'2'="no"))%>% ggplot(aes(y=both_amount,x=REFI,fill=REFI))+geom_boxplot()+ylim(0,2.5*10^6)
mortgage %>% filter(!is.na(REFI))%>% mutate(REFI=recode(REFI,'1'="yes",'2'="no")) %>% group_by(REFI) %>% summarise(u=mean(both_amount),median=median(both_amount,na.rm = T))
## # A tibble: 2 × 3
## REFI u median
## <chr> <dbl> <dbl>
## 1 no 244919. 192902.
## 2 yes 270502. 215351
Primary loan types of loans that people carry the highest debt.
mortgage %>%group_by(CONTROL) %>% mutate(n=n()) %>% ungroup()%>% filter(n==1) %>% group_by(LOANTYPE) %>% summarise(mean=mean(both_amount,na.rm = T))
## # A tibble: 2 × 2
## LOANTYPE mean
## <chr> <dbl>
## 1 Home equity line of credit 151052.
## 2 Primary loan 257795.
We have now investigated the three main tables individually. Use these results to help you formulate three questions that we can ask after merging the data in the next section. For each question, be sure to state why the previous results caused you to ask this question.
The distribution of variables mainly analyzed above includes TOTHCAMT, MARKETVAL, and FINCP. we consider the condition distribution of different races to see if there is racial discrimination and other situations.Which race may have higher home values or household incomes
Because we looked at the level of housing consumption in different regions, and further considered the influence of race, which may affect normal life, we looked at whether there were differences in the number of major ethnic groups in different regions, their level of rent spending, and what is the total household income
Since we have studied the relationship between the number of bedrooms and the size of the house, different groups of people have different life preferences, and the race added to the person table is considered to study the preferences of different groups of people.Which races prefer to live in groups.
Write a paragraph describing the difference between these ways of
merging the household and mortgages table:
CONTROL columnCONTROL columnmortgages table to get total
mortgage amounts and payments, then just a left join households to the
aggregated mortgages table.If we would want to compare mortgage amounts for primary, second, and
HELOC loans for each region (DIVISON column)? Implement
this solution and use a facet plot to show the distributions of primary,
secondary, and HELOC loans by region.
A left join is used to match one mortgage per line of the household data.According to each symbol of the left table matches the corresponding right table, the number of rows in the 1 pair is increased, and the number of rows is retained when there is no match
An inner join is employed to retain rows that have a match, removing any rows that do not have a match. When there is a one-to-many relationship, the join operation will add rows accordingly.
The mortgages are first consolidated into a single loan column, allowing the CONTROL identifier to uniquely correspond to each household line. Each mortgage row is matched during the merge, resulting in a table that represents the head of the household with a mortgage
mortgage %>% group_by(CONTROL) %>%
left_join(household,mortgage,by='CONTROL')
## # A tibble: 19,155 × 324
## # Groups: CONTROL [18,108]
## CONTROL AMMORT REFI MORTCLASS MORTTYPE TAXPMT PMTONLY MORTLINE REFICSHAMT
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11000007 250364 2 1 1 1 1629 1 NA
## 2 11000017 152441 2 1 1 1 753 1 NA
## 3 11000017 203067 2 1 1 2 1073 2 NA
## 4 11000042 383327 1 1 1 1 1395 1 NA
## 5 11000048 128553 2 1 1 1 1322 1 NA
## 6 11000055 198463 1 1 1 1 1137 1 NA
## 7 11000062 33046 2 1 2 2 178 1 NA
## 8 11000078 165332 2 1 1 1 731 1 NA
## 9 11000078 NA NA 3 NA NA NA 2 NA
## 10 11000079 26324 1 1 1 1 119 1 NA
## # ℹ 19,145 more rows
## # ℹ 315 more variables: MORTADDTN <dbl>, HELOCLIM <dbl>, REFICSH <dbl>,
## # HELOCBAL <dbl>, HELOCADD <dbl>, PMTFREQ <dbl>, LOANTYPE <chr>,
## # MISCPMT <dbl>, PMTAMT <dbl>, INTRATE <dbl>, both_amount <dbl>,
## # TOTROOMS <dbl>, PERPOVLVL <dbl>, DBEVICLK <dbl>, DBEVICNOTE <dbl>,
## # DBEVICTHT <dbl>, DBEVICWHERE <dbl>, DBFORCWHR <dbl>, DBLVEFORC <dbl>,
## # PETSCAT <dbl>, PETSDOG <dbl>, PETSBIRD <dbl>, PETSFISH <dbl>, …
mortgage %>% group_by(CONTROL) %>%
inner_join(household,mortgage,by='CONTROL')
## # A tibble: 19,155 × 324
## # Groups: CONTROL [18,108]
## CONTROL AMMORT REFI MORTCLASS MORTTYPE TAXPMT PMTONLY MORTLINE REFICSHAMT
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11000007 250364 2 1 1 1 1629 1 NA
## 2 11000017 152441 2 1 1 1 753 1 NA
## 3 11000017 203067 2 1 1 2 1073 2 NA
## 4 11000042 383327 1 1 1 1 1395 1 NA
## 5 11000048 128553 2 1 1 1 1322 1 NA
## 6 11000055 198463 1 1 1 1 1137 1 NA
## 7 11000062 33046 2 1 2 2 178 1 NA
## 8 11000078 165332 2 1 1 1 731 1 NA
## 9 11000078 NA NA 3 NA NA NA 2 NA
## 10 11000079 26324 1 1 1 1 119 1 NA
## # ℹ 19,145 more rows
## # ℹ 315 more variables: MORTADDTN <dbl>, HELOCLIM <dbl>, REFICSH <dbl>,
## # HELOCBAL <dbl>, HELOCADD <dbl>, PMTFREQ <dbl>, LOANTYPE <chr>,
## # MISCPMT <dbl>, PMTAMT <dbl>, INTRATE <dbl>, both_amount <dbl>,
## # TOTROOMS <dbl>, PERPOVLVL <dbl>, DBEVICLK <dbl>, DBEVICNOTE <dbl>,
## # DBEVICTHT <dbl>, DBEVICWHERE <dbl>, DBFORCWHR <dbl>, DBLVEFORC <dbl>,
## # PETSCAT <dbl>, PETSDOG <dbl>, PETSBIRD <dbl>, PETSFISH <dbl>, …
tab<-mortgage %>% group_by(CONTROL) %>%
mutate(sum=sum(both_amount))%>% ungroup()%>%
left_join(household,by='CONTROL')
mortgage %>% left_join(household,by='CONTROL') %>%
ggplot(aes(x=both_amount,y=DIVISION))+geom_violin()+
facet_wrap(~LOANTYPE,ncol=1,scales = 'free')
FINCP against the total mortgage amount. Comment on the
results.AMMORT and HELOCLIM. Suppose this column is
called both_amount. Using pivoting, create columns for the
both_amount_primary, both_amount_secondary,
and both_amount_heloc. For mortgages with both primary and
HELOC mortgages, plot the joint distribution of these values. You will
probably need to group by CONTROL after pivoting to get
totals.The fitting line shows a positive relationship
tab %>% ggplot(aes(FINCP,sum))+
geom_point()+
geom_smooth()
tab %>%
pivot_wider(names_from =LOANTYPE,
values_from = both_amount)%>%
group_by(CONTROL) %>%
summarise(both_amount_primary=sum(`Primary loan`,na.rm = T),
both_amount_secondary=sum(`Secondary loan`,na.rm = T),
both_amount_heloc=sum(`Home equity line of credit`,na.rm = T))
## # A tibble: 18,108 × 4
## CONTROL both_amount_primary both_amount_secondary both_amount_heloc
## <dbl> <dbl> <dbl> <dbl>
## 1 11000007 250364 0 0
## 2 11000017 152441 203067 0
## 3 11000042 383327 0 0
## 4 11000048 128553 0 0
## 5 11000055 198463 0 0
## 6 11000062 33046 0 0
## 7 11000078 165332 0 0
## 8 11000079 26324 0 0
## 9 11000082 170943 0 0
## 10 11000088 260840 0 0
## # ℹ 18,098 more rows
group_by/summarize or an iteration technique
to address your question.left_join(household,person,by='CONTROL')
## # A tibble: 144,009 × 353
## CONTROL TOTROOMS PERPOVLVL DBEVICLK DBEVICNOTE DBEVICTHT DBEVICWHERE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11000005 8 465 NA NA NA NA
## 2 11000005 8 465 NA NA NA NA
## 3 11000007 8 501 NA NA NA NA
## 4 11000007 8 501 NA NA NA NA
## 5 11000007 8 501 NA NA NA NA
## 6 11000009 5 NA NA NA NA NA
## 7 11000010 5 12 3 NA 2 1
## 8 11000010 5 12 3 NA 2 1
## 9 11000013 7 501 NA NA NA NA
## 10 11000013 7 501 NA NA NA NA
## # ℹ 143,999 more rows
## # ℹ 346 more variables: DBFORCWHR <dbl>, DBLVEFORC <dbl>, PETSCAT <dbl>,
## # PETSDOG <dbl>, PETSBIRD <dbl>, PETSFISH <dbl>, PETSHH <dbl>,
## # PETSREPT <dbl>, PETSSMAM <dbl>, RENT <dbl>, WFRALERT <dbl>, WFRROOF <dbl>,
## # WFRSIDING <dbl>, WFRWFENCE <dbl>, WFRWPILE <dbl>, WFRWSHED <dbl>,
## # BATHEXCLU <dbl>, DISHWASH <dbl>, SOLAR <dbl>, GARAGE <dbl>, DINING <dbl>,
## # LAUNDY <dbl>, NOSTEP <dbl>, GUTREHB <dbl>, CONDO <dbl>, SEARCHCRED <dbl>, …
left_join(household,person,by='CONTROL') %>% group_by(RACE) %>% summarise(TOTHCAMT1=median(TOTHCAMT),MARKETVAL2=median(MARKETVAL,na.rm=T),FINCP3=median(FINCP)) %>% arrange(-MARKETVAL2)
## # A tibble: 21 × 4
## RACE TOTHCAMT1 MARKETVAL2 FINCP3
## <chr> <dbl> <dbl> <dbl>
## 1 American Indian, Alaska Native / Asian 1967 931263 83010
## 2 Black / Hawaiian, Pacific Islander 1344. 758329 36850
## 3 Asian / Hawaiian, Pacific Islander 1909 668180 122500
## 4 Asian only 1955 556104 100000
## 5 White / Asian / Hawaiian, Pacific Islander 1741 548172 89200
## 6 White / Asian 2138 514670. 125000
## 7 Other combinations of 2 or 3 races 1600 504230. 47000
## 8 White / American Indian, Alaska Native / Asian 2293 487500 86000
## 9 White / Black / Asian 1907 460844 139000
## 10 White / Hawaiian, Pacific Islander 1834 451128 103800
## # ℹ 11 more rows
left_join(household,person,by='CONTROL') %>% pivot_longer(col=c(TOTHCAMT,MARKETVAL,FINCP),names_to = "vv") %>% ggplot(aes(value,RACE,fill=vv))+geom_violin()+facet_grid(~vv,scales = "free")+theme(legend.position = 'none')
left_join(household,person,by='CONTROL') %>%
filter(!is.na(RACE)) %>%
group_by(DIVISION,RACE) %>% summarise(n=n(),mTOTHCAMT=median(TOTHCAMT,na.rm=T),mFINCP=median(FINCP,na.rm=T)) %>% group_by(DIVISION) %>%
arrange(-n,.by_group = T) %>%
slice_head(n=3)
## # A tibble: 27 × 5
## # Groups: DIVISION [9]
## DIVISION RACE n mTOTHCAMT mFINCP
## <chr> <chr> <int> <dbl> <dbl>
## 1 East North Central White only 13325 1100 69600
## 2 East North Central Black only 2963 940 30000
## 3 East North Central Asian only 848 1470 89002
## 4 East South Central White only 3765 820 50000
## 5 East South Central Black only 1665 749 22100
## 6 East South Central Asian only 81 1539 77050
## 7 Middle Atlantic White only 9935 1440 70000
## 8 Middle Atlantic Black only 3021 1065 32000
## 9 Middle Atlantic Asian only 1082 1807 89700
## 10 Mountain White only 8063 1354 69000
## # ℹ 17 more rows
household2 %>% left_join(person,by='CONTROL') %>% group_by(RACE) %>% summarise(n=n(),BEDROOMS=mean(BEDROOMS,na.rm = T),UNITSIZE=mean(UNITSIZE_midpoint,na.rm = T)) %>% arrange(-n)
## # A tibble: 21 × 4
## RACE n BEDROOMS UNITSIZE
## <chr> <int> <dbl> <dbl>
## 1 White only 97395 3.02 1853.
## 2 Black only 22358 2.79 1546.
## 3 Asian only 10491 3.13 1995.
## 4 <NA> 8083 2.28 1371.
## 5 American Indian, Alaska Native only 1809 2.87 1552.
## 6 White / Black 1103 2.98 1587.
## 7 White / American Indian, Alaska Native 723 2.85 1621.
## 8 White / Asian 717 3.27 2053.
## 9 Hawaiian, Pacific Islander only 683 3.01 1751.
## 10 Black / American Indian, Alaska Native 183 2.76 1605.
## # ℹ 11 more rows
household2 %>% left_join(person,by='CONTROL') %>%filter(RACE%in%c('White only','Black only','Asian only')) %>% ggplot(aes(y=BEDROOMS,x=UNITSIZE_midpoint,col=RACE)) +geom_point(position = "jitter",alpha=0.6,size=0.5)+geom_smooth(method = "lm",col='red')